1  Review of Optimization basics

1.1 First definitions

1.1.1 Why we need Calculus

Suppose you have a function f(x) and you want to find the value of x that makes f(x) as large as possible. This is called a maximum.

Our issue is thus to find various forms of maxima and minima. We denote this problem as follows:

Number Objective Name Mathematical expression Key elements
(1) Unconstrained Maximization \max_x f(x) The only restriction on f is the nature of its admissible inputs \mathbf{dom}(f)
(2) Finding the global maximizer x^\star \gets \argmax_x f(x) There is generally a more restriction set of assumptions than in (1) to find the precise x^\star for which f is minimized
(3) Finding the a maximiser within a constraint set \underset{x\in\mathcal{X}}{\max f(x)} In this case, there is no possibility for x to go outside of a predefined set \mathcal{X}, this is very useful in real-life applications

When is calculus useful and not useful ? Here’s a first reason it might or might not be useful: the domain \mathcal{X} of the input variable, or (which is often the same), the domain \mathbf{dom}(f) or our function.

A pretty important part of the problem is what we call the domain of the function. This is the set of values that x can take on. For example, if f(x) = x^2, then the domain is the set of real numbers \mathbb{R}.

Domains change everything because, for example, if your domain \mathbf{dom}(X) is discrete, then the image \mathbf{im}_f(X) (also denoted f(\mathbf{dom}(X))) is also discrete, and you just end up with the array-sorting problem, for which efficient algorithms like quicksort exist..

The derivative of a function f(x)\colon \mathbb{R}\to\mathbb{R} is a function f'(x) is just what everyone knows, that is the limit of

f'(x) = \lim_{h\to 0} \frac{f(x+h) - f(x)}{h}

When you have a function of several variables, say f(x_1, x_2, \dots, x_n), whose values are still real numbers, then you can still behave as if nothing happened, because if you fix the n-1 variables

\bar{x}_{\scriptscriptstyle-i} \coloneqq (\bar{x}_1,\ldots,\bar{x}_{\scriptscriptstyle i-1},x_{\scriptscriptstyle i+1},\dots\bar{x}_n)

Then the univariate function g(x_i) is still a function of a single variable, and you can still take its derivative:

g\colon x \mapsto f(\bar{x}_{\scriptscriptstyle-i}, x) \coloneqq f(\bar{x}_1,\ldots,\bar{x}_{\scriptscriptstyle i-1},x_i, x_{\scriptscriptstyle i+1},\dots\bar{x}_n)

We have a bunch of fancy notation for this, namely \partial_i f(\bar{x}) for short, or \frac{\partial f}{\partial x_i}(\bar{x}). We also sometimes speak of the directional derivative of f in the direction of \bar{x}, which is often denoted D_{\bar{x}}f.

What is not clear though, is how we relate each derivative to the others. How does \partial_i f(\bar{x}) relate to \partial_j f(\bar{x})?

There is a good theorem to explain this !

1.1.2 The Jacobian

The theorem that defines and cements the derivatives is the following:

Let f\colon \mathbb{R}^n \to \mathbb{R}^m be a function. Then, at each point x where f is differentiable, there is a unique matrix M(x) so that: f(x+h) \approx f(x) + M(x)h +\varepsilon (h) where \varepsilon (h) is a small vector in \mathbb{R}^m that goes to 0 faster than h does1.

1.1.3 The Proof

The uniqueness is fairly easy to prove, because if there are two matrices M_1 and M_2 such that f(x+h) \approx f(x) + M_1h and f(x+h) \approx f(x) + M_2h, then we can subtract the two equations, (M_1 - M_2)h = \varepsilon_1 (h) + \varepsilon_2 (h)

divide by \lvert{h}\rvert to get that the LHS tends to M_1 - M_2 as h\to 0, and the RHS tends to 0 as h\to 0, so we get that M_1 - M_2 = 0.

The existence is essentially the notion that we take matrices as rectangular arrays of numbers. In that sense, if the matrix M(x) has coefficients \left(\smash{\alpha_{ij}(x)}\right)_{i,j}, then we can write, our equation is merely a statement that is 1-D (which we know how to deal with):

\begin{aligned} f(x+h) &= f(x) + M(x)h +\varepsilon (h) &\\ \iff f_j(x+h) &= f_j(x) + \sum_{i=1}^m \alpha_{ij}(x)h_j + \varepsilon_j (h) &\forall j\in\{1,\dots m\} \\ \iff f_j(x_i+h_i) &= f_j(x_i) + \sum_{i=1}^m \alpha_{ij}(x_i)h_j + \varepsilon_j (h_i) &\forall j\in\{1,\dots m\},\,\forall i\in\{1,\dots n\} \\ \end{aligned}

This last set of equation is just a m\times n grid of first order approximations for each f_j at each x_i, and we can solve for the \alpha_{ij} by just taking the limit as h_i\to 0 for each i2.

If you hadn’t guessed before, each of these equations are looking very much like \varphi(x+h) \approx \varphi(x) + \varphi'(x)h + \varepsilon(h)

we know in 1D, and we can just take the limit as h\to 0 to get that \varphi'(x) = \alpha_{ij}(x).

What this proves is 3 things:

  1. The Jacobian is a matrix, and it is unique.
  2. The Jacobian is a linear operator, and it is continuous.
  3. The Jacobian matrix is populated by the partial derivatives of f, that is

\textbf{Jac}[f](x) = \left[\dfrac{\partial f_j}{\partial x_i}(x)\right]_{\substack{1\leq j \leq m\\1\leq j \leq n}}

Careful about the dimensions !

The jacobian sends points from a n-dimensional space to a m-dimensional space, so the Jacobian matrix is a m\times n matrix. This is why the partial derivatives are written in the order \dfrac{\partial f_j}{\partial x_i}, because the jth row of the matrix is the partial derivatives of f_j with respect to the x_i’s.

This is very counterintuitive to anyone who works with the gradient \nabla f(x), which is a n-dimensional vector, and is the transpose of the Jacobian matrix.

1.1.4 A small example using python

Let’s take a look at a small example I made using python’s sympy library. Through graphing increasingly complex 2D functions on an online tool, I found a function that is complex enough for us to have fun, but not too complex that we can’t understand it.

The function is:

G(x,y) = \log\left[1+\left(x^{\frac{1}{3}} - y^{\frac{2}{3}}\right)^2\left(0.1+\sin \left(x\right)^2\cdot \cos \left(x\right)^2\right)^{-1}\right]

But I think it’s easier to understand if we write it as:

G(x,y) = \log\left(1 + \dfrac{A(x,y)^2}{\gamma + B(x,y)^2}\right)

with A(x,y) = x^{\frac{1}{3}} - y^{\frac{2}{3}} and B(x,y) = \sin \left(x\right)\cdot \cos \left(x\right).

We can plot this function using plotly:

Code
import plotly.graph_objects as go
import requests as rq
import json
from pathlib import Path

template = json.loads(
    Path("../plotlyTemplate.json").read_text()
)



data = rq.get("https://storage.googleapis.com/open.data.arnov.dev/static/plots/optimization/CostFunction2D.json")

fig = go.FigureWidget(data=data.json())

fig.update_layout(
    **template,
)

fig.show()

A 3D plot of the function G(x,y) = \log\left(1 + \dfrac{A(x,y)^2}{\gamma + B(x,y)^2}\right)


  1. in the sense of a norm: \lvert x\rvert, \varepsilon(h)/\lvert x\rvert\to 0 as \lvert x\rvert\to 0.↩︎

  2. There is a subtelty here, because we can’t really just let any h_i to zero before the others arbitrarily, but if we do it in any order according to most used norms (here we’ll use the euclidean norm \lvert \cdot \rvert), then we get that the limit exists and is unique.)↩︎